法国专利FR3043222A1 METHOD FOR PARALLEL SIMULATION OF ELECTRONIC SYSTEM LEVEL WITH DETECTION OF CONFLICTS OF ACCESS TO A

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A parallel electronic system level simulation method using a multi-core computer system, comprising parallel evaluation of a plurality of competing processes of said simulation on a plurality of cores of said computer system and comprising a detection sub-method conflicting access to a shared memory of a simulated electronic system, said sub-method being implemented by a simulation core executed by said computer system and comprising: a step of constructing a representative access oriented graph said memory shared by the processes evaluated by said concurrent processes; and a loop detection step in said graph; a loop being considered representative of a conflict of access to said shared memory. Computer program product for the implementation of such a method.
公开号:FR3043222A1
申请号:FR1560551
申请日:2015-11-04
公开日:2017-05-05
发明作者:Nicolas Ventroux；Tanguy Sassolas
申请人:Commissariat a lEnergie Atomique CEA；Commissariat a lEnergie Atomique et aux Energies Alternatives CEA；
IPC主号:

专利说明:

METHOD FOR PARALLEL SIMULATION OF ELECTRONIC SYSTEM LEVEL WITH DETECTION OF ACCESS CONFLICTS
The invention relates to the field of tools and methodologies for designing systems on a chip (SoC, "System on a Chip"). More precisely, it concerns the parallel implementation of high-level simulations of such systems.
A complex electronic system generally comprises a specific hardware platform and application code intended to be executed on such a platform. The system design requires, among other things, to validate the execution of the application code on the hardware platform before its final design. The costs of the design and manufacturing phase are too important to carry out several tests: the entire system must be validated before its manufacture, and this in the shortest possible time. Thus, high-level modeling tools have emerged that can model the software and hardware parts of a complex system and enable software prototyping such as architectural exploration. These tools also sometimes offer the possibility of simulating the user interfaces to accompany the application development to the final product. This is referred to as "electronic system level" (ESL) simulations.
It has become essential during the design phase to be able to use these software platforms. They facilitate the development of low-level software (drivers, operating system ...), or to do architectural exploration. Architectural exploration is an optimization phase that can be used to define the size and characteristics of the various elements belonging to the system, such as the size or type of cache memory, or for example the size of the links of the interconnection networks. More recently, ways have been added to prototyping tools to study other parameters such as energy consumption or temperature. The possibility of having all this information very early in the design flow offers many benefits directly impacting the competitiveness of the end product. It allows for example to make better architectural choices to increase performance and energy efficiency, or even to parallelize the different phases of design to significantly reduce design time.
A large majority of these software prototyping tools are based on the C / C ++ simulation library and kernel called System, as well as its extension named Transactional Level Modeling (TLM). Both of these elements are part of the IEEE 1666-2011 standard. The invention applies in particular to SystemC / TLM simulations, and will be described with reference to this particular case. It can nevertheless be applied to other electronic system level simulation techniques, in particular those using (like SystemC and TLM) a Discrete Event Simulation (DES) kernel. that is, considering that the state of the system evolves only in response to discrete events, instead of following its behavior in a continuous manner.
Outside the initialization phases, whose durations are proportionally limited, the System core is composed of 5 main and sequential phases: the system process evaluation phase, the immediate notification phase, the update phase , the delta notification phase and the time notification phase. The first 3 phases form what is called a delta cycle ("delta-cycle" in English).
The course of the simulation is illustrated by the flow chart of Figure 1.
A System process is a function or a software task describing the behavior of a part of a system module. At initialization, all System processes are executed in an arbitrary order; these processes are, in principle, competitive, that is, they represent processes that, in the real component, will be executed in parallel. During the evaluation phase, all processes in a queue are evaluated. During its evaluation, each process can write on signals or output ports (delta notification), notify an event to wake up other dependent processes (immediate notification) or generate a temporal event (time notification). An immediate notification has the effect of immediately positioning sensitive processes in the queue. The evaluation phase ends only when the entire queue has been processed, so that there is no longer a process ready to be evaluated. The update phase then takes over. This phase consists in updating all the signals and output ports that have been modified during the different successive evaluation phases. As in any hardware description language, it is important that the signal states are updated only at the end of the evaluation. Indeed, all processes are evaluated sequentially and this is the only solution ensuring that a reading of a modified signal will return the current value and not the new value, as would happen in truly competitive modules. At the end of the update phase, the delta notification phase begins. It consists in putting all the processes sensitive to the delta notifications events in the waiting list for a future evaluation. Typically, the writing of a signal will generate this type of notification and all processes sensitive to this signal will be evaluated later. If the queue is not empty (there are processes that are ready to be evaluated), the evaluation phase is restarted and all the other phases that follow, until the queue is empty after the delta notification phase; this is called the "delta cycle". Finally, if the simulation is not completed, the time notification phase takes place. It consists in checking if temporal notifications are present; if so, identify the nearest time event, update the current time of the simulation, check again if the simulation is complete and, if not, initiate a new process evaluation phase. . In general, the SystemC simulation stops when the simulation time reaches the initially requested simulation time or when no more processes are ready to execute at the end of the time notification phase.
Until a few years ago, the design of a system involved the implementation of a prototyping software solution capable of executing its application code as well as supporting architectural exploration. One and the same model served as a unique platform for software (driver, system software ...) and hardware design. These simulators incorporated the conventional design flow ensuring a unified flow from applications to hardware. This is not always possible because of the growing complexity of the systems to be designed.
A solution to this problem, which is nowadays generally used by manufacturers, is to use two different software platforms: on the one hand, software prototypes, as simulator for application development; on the other hand, material prototypes for exploration, analysis and architectural design. This separation is mainly due to the fact that hardware prototypes have become too slow for application development. On the contrary, a greater abstraction of the behavior of the platform or the temporal information makes it possible to significantly reduce the simulation times. Unfortunately, this approach has its limits. Indeed, the complexity of the systems being always more important, it will not always be possible to improve the simulation times by reducing the accuracy. In addition, the loss of information and accuracy of the models used during software development irreparably introduces errors into the design flow. Some current tools try to guarantee the compatibility of the prototypes, but this one is only possible on certain specific platforms, whose models are preconceived. This has the effect of reducing the spectrum of possible solutions.
Another approach is to maintain the unity of the platform, and to accelerate the simulations by means of efficient parallel execution of the System kernel over multiple compute cores. This approach - which is also that of the invention - is illustrated, for example, in documents [EZUD09], [SCHU10], [CHU 14],
Figure 2 illustrates, in a simplified way, the progress of a Parallel System simulation.
After initialization, which includes the build and bind phases, that is, the instantiation and dynamic creation of the model to be simulated, the kernel creates a plurality of threads ("threads"). English) which preferably are each associated with a respective heart of the computer system (logical core); the term "thread" generally refers to a group of processes that are evaluated sequentially. Then the processes of the model are distributed between the threads of execution (and thus between the logical cores). A queue is associated with each thread, and all processes that are ready to be evaluated are pushed into their queue. The order of the processes in each queue is arbitrary because the result of the simulation is supposed to be independent of that order within each evolution phase. In the following we will consider only the case where each logical heart evaluates a single thread (or none); however, this is not a fundamental limitation and the invention also covers the case where at least some logical cores can evaluate multiple threads and the case where several logical cores evaluate a single thread of execution. sequentially.
Each execution phase starts by waking the threads, numbered from 0 to N, which then evaluate the processes of their queue in parallel. The kernel stays on hold until all evaluations are complete.
The other stages of the simulation are traditional: the kernel checks if immediate notifications are present and, if so, the processes awakened by these notifications are placed in the corresponding queues; then the kernel checks if delta notifications are present and, if so, the processes woken by these notifications are placed in the corresponding queues. Then, unless the simulation is complete, the kernel checks if time notifications are present; if so, it identifies the nearest time event, updates the current time of the simulation, checks if the simulation is complete, and if not, pushes the processes awakened by the time notifications in their queues. wait.
A "classical" system simulation (performed sequentially by a single logical core, as in the case of Figure 1) is deterministic. The sequential evaluation of SystemC processes ensures that, for given inputs and for the same instance of the simulator, the result will always be the same. In other words, the behavior of a SystemC simulation is reproducible. This is important for software development and troubleshooting ("debugging") when the simulator is the runtime platform, for debugging SystemC models, or for solving race conditions. Nevertheless, SystemC has a certain indeterminism whose impact can be important according to the description of the simulator. For example, the use of primitives symbolizing "mutex" or semaphores, or the use of immediate notifications to synchronize processes, can result in non-deterministic behavior. Indeed, as indicated above, the evaluation order of the different processes during the evaluation phase is indeterminate (non-deterministic) and depends on the implementation of the SystemC model. This feature in SystemC has never been corrected by static process scheduling, since nondeterminism in SystemC has some advantages. First, it allows the modeling of non-deterministic systems as would be for example real parallel systems. Then, this makes it possible to show, according to the entries of the system, modeling errors. According to SystemC specifications, the architect using SystemC must know at the time of design that the order of process evaluation during the evaluation phase is undetermined.
In general, competition-related errors in parallel software systems can be of four types. The so-called "race-race" errors, atomic access violations, order violations and inter-blocking. These phenomena can occur when two parallel processes perform a critical section in a non-atomic way. In other words, when two parallel processes access shared, unprotected data, and at least one write intervenes. The effects range from immediate irreversible errors to silent data corruption that can lead to equally important problems. A sequential system simulation has the disadvantage, and sometimes the advantage, of hiding most of the competition-related errors, regardless of the type of communication used. Thus, most of these errors can not be modeled and resolved. The designer must wait for a run on the hardware platform to resolve the bugs related to the competition. Some work in the literature suggests modifying the scheduling of processes during the evaluation phase to address this problem and intensively studying the impact of the evaluation order on the behavior of the system.
A disadvantage of System simulations is therefore that sequential and deterministic execution leads to masking most competition errors. These simulations do not always allow a complete validation of the modeled system. For this reason, research has been conducted to identify means for detection and verification of determinism in System models.
[BLAN08] describes a new System compiler integrating a static model checking technique. This technique consists in checking algorithmically if a given SystemC model satisfies a specification. In [BLAN08], the audit aims to improve performance and to detect competitive errors. The main limitation of this approach is its inability to address large complexity issues, or using dynamic allocations or pointers. In addition, a large number of SystemC primitives can not be supported.
[HELM06] proposes to generate multiple scheduling orders to highlight competition errors. To do this, [HELM06] proposes to instrument each access to shared variables and to monitor the reads and writes, in order to generate a graph, called "dependencies", between the processes during the execution of the simulator. This graph is then used to generate, via a partial-order dynamic reduction technique, all possible scheduling that may reveal competition errors. The number of possible scheduling remains nevertheless very important and this technique does not make it possible to address large systems.
[SEN08] proposes to instrument the SystemC models to detect all dependencies during simulator execution using a vector logical time technique, then to formally analyze the generated execution traces against predetermined assertions . These traces give a detailed information of the different partial orders obtained during each evaluation phase and then make it possible to detect possible interlockings or competition errors. These jobs do not support TLM communications.
"Test-based" approaches consist in generating several possible orderings and in comparing the results produced to detect possible problems of determinism. For example, [LE14] uses a transformation of the SystemC model into a completely static model in C; the processes then communicate with each other only via shared variables. Then two orderings of the processes are generated, one completely static and ordered, and the other whose evaluation order is undetermined. The difference in results is analyzed to highlight problems of determinism. On the contrary, [HERR06] proposes to generate a random ordering and to execute multiple simulation instances to detect possible errors. Finally, [SCHU12a] proposes a slightly different approach by recording different observable quantities such as process activation, stack allocation, standard output, use of random numbers, and so on. The successive execution of simulations by comparing these quantities allows the detection of competition errors. The main disadvantage of these techniques lies in the need to perform several successive executions and thus increase the simulation time.
The "naïve" parallelization of SystemC induces other competition-related difficulties. Indeed, without particular precaution, such a parallelization inevitably leads to errors related to competition, and therefore to a non-deterministic behavior of the simulation. This is not compatible with the determinism required by SystemC. Parallelizing the SystemC simulation kernel is a difficult problem because such a parallel kernel should not generate a different behavior than a sequential implementation. Competence conditions ("race condition"), the order and the temporal instants of activation of the processes, and the implementation of all the previously described mechanisms must be respected. The difficulty lies in the implementation of such a parallelization to ensure a high level of performance and the determinism of the simulations. This problem (how to parallelize SystemC while respecting the condition of determinism) has been addressed in the following publications: [CHEN12] proposes to detect the occurrence of competition-related errors by a static analysis of the SystemC process code, capable of detect dependencies between processes, focusing on transitions (that is, the code executed between two calls to the "wait" primitive). With this information, it is possible to define a process scheduling that guarantees the non-generation of competition errors by sequentialising the dependent processes. Such a static approach is very cumbersome and leads to the generation of solutions that can not be parallelized with the use of TLM communications. Indeed, all processes accessing a memory element in TLM are executed sequentially.
[EZUD09] uses synchronization mechanisms based on the declaration of OpenMP critical sections to protect shared kernel variables that must be accessible by system processes that are evaluated in parallel. These techniques prohibit the use of shared variables or use particular synchronizations to access them, requiring the user to guarantee the atomicity of memory accesses and communications.
[MELL10] exploits the fact that ESL simulations are generally discrete events and, therefore, do not use a clock, to allow, in the context of a speculative approach, certain temporal violations of execution through to the addition of certain logical mechanisms. System processes have a local time that is synchronized with other elements of the system at each communication or whenever a predefined time threshold (also called quantum) is reached. Thus the temporal violation can not exceed the duration of a quantum and the synchronization constraint, which is penalizing for the performances, is released. This requires strongly constraining the description of the model and prohibits certain uses of SystemC. This approach is therefore not usable in a general way and above all is very expensive because of the use of speculation mechanisms.
[MOY13] also proposes to not strictly respect the temporal scheduling of SystemC processes and to use a particular library to declare the processing of dependent processes, strongly constraining the way of describing models in SystemC. Each process in a process is associated with an execution time. A static analysis then makes it possible to define a total order of the processes and to define which ones can be executed in parallel without risk of conflict. This mechanism works only if the dependent processes can be out of sync and can not be generalized in the case of TLM simulations whose set of processes may need to access randomly a shared resource such as memory.
[SCHU13] proposes mechanisms to resolve access conflicts of parallel and dependent processes and guarantee the atomicity of simulations. Simulations based on communications using the TLM 2.0 extension are here supported. The first mechanism is to create process groups that will be evaluated on independent threads. When a process wants to communicate with another process belonging to another group, then this one is dynamically migrated with its context to the other group. This is done during the evaluation phase and guarantees the atomicity of the execution of the different processes. All communications must then go through TLM interfaces. For direct communications via global or static variables, special manual precautions must be taken by the designer if different group processes can access them. All modules in a group can access member functions of other modules in the same group. In order to allow the predictability and determinism of the simulations, an ordered execution queue is used for each group. This ensures that the dependent processes always run in the same order at each evaluation phase. The major disadvantage of this approach is the migration of processes to other threads. Moreover, often in a System simulation, the target of a transaction is not a System process but a simple function and it is therefore not possible with this method to guarantee the atomicity of its accesses. Finally, the integration of dependent processes in the same group limits the possible parallelization and therefore the performances.
In conclusion, System's sequential (conventional) implementations have the disadvantage of masking competition errors, whereas parallel implementations can highlight these errors, but at the cost of loss of determinism, which is not acceptable. . Several solutions, which have just been described, have been proposed to overcome these drawbacks, but none gives complete satisfaction. The invention aims to overcome these disadvantages of the prior art. More specifically, it aims to allow simulations of electronic system level executed in parallel on a multi-core computer system, respecting the condition of determinism while highlighting the situations likely to lead to competition errors.
According to the invention, this objective is achieved by constructing, during the parallel execution of an electronic system level simulation, a dependency graph and its dynamic exploitation to detect determinism errors. atomicity or to "replay" deterministically a previous execution of the simulation.
According to various embodiments, the invention exploits the dynamic creation of dependency graphs defining a partial order between the processes executing in parallel during an evaluation phase, the generation of an ordered list of processes to be executed sequentially in the same evaluation phase and the use of the ordered list to sequentially sequence processes accessing memory areas to be atomic. The use of these graphs makes it possible to check the atomicity properties of the evaluation phase by the complementary use of a strongly related component search algorithm. In case of detection of a competition error, it is then possible to go back and re-run the evaluation phase in a partially sequential order (this is however heavy to set up), or to modify the model to prevent the interweaving effect of processes. Finally, saving all the graphs after their serialization can allow a later use to reproduce the same order of evaluation of the processes within each evaluation phase and to obtain a deterministic and predictable model for the debugging.
An object of the invention is therefore an electronic system level parallel simulation method implemented by means of a multi-core computer system, said method comprising the parallel evaluation of a plurality of competing processes of said simulation on a plurality of logical cores of said computer system, said concurrent processes being grouped together in execution threads, the competing processes of the same thread being evaluated sequentially by the same logical heart of the system, the method being characterized in that it comprises a sub-method for detecting conflicts of access to a shared memory of a simulated electronic system, said sub-method being implemented by a simulation kernel executed by said computer system and comprising: a construction step of a representative graph of access to said memory shared by the evaluated processes by said competing processes; and a loop detection step in said graph; a loop being considered representative of a conflict of access to said shared memory.
According to particular embodiments of the invention:
Said oriented graph may comprise nodes, each representing a thread, and arcs each connecting two nodes, each arc representing an execution order relationship between the two said execution threads (ie between at least one process of each of the two threads).
The method may comprise a plurality of concurrent process evaluation phases corresponding to successive simulation times, wherein said steps of the access conflict detection sub-method are implemented after each said evaluation phase.
The method may also include: during said parallel evaluation of a plurality of concurrent processes, a step of monitoring at least one zone of said shared memory of the simulated electronic system previously declared as critical, and pre-empting all the processes belonging to to the same thread that a process having tried to access said zone or zones after a first access by another process belonging to another thread; and after said parallel evaluation, a step of sequentially evaluating the preempted processes in the previous step.
Said step of constructing an oriented graph may comprise: constructing a plurality of partial oriented graphs, each representative of access to a subset of said shared memory; and merging these graphs to form a single oriented graph. In this case, and if said shared memory of the simulated electronic system comprises a plurality of memory locations grouped into a plurality of pages, said construct of a plurality of partial oriented graphs may comprise: constructing a plurality of partial oriented graphs , each representative of access to a respective memory lease; the fusion of the partial oriented graphs corresponding to the memory locations of the same page. Moreover, the construction of each said partial oriented graph may comprise: during each read access to said subset of the shared memory, the identification and the memorization of an identifier of the thread of execution of which a process has performed said reading access; during each write access, the creation of: a node, called current node, representing the execution thread of which a process has made said write access; Nodes, said previous reader nodes, representatives of the threads whose at least one process has read access to said subset of the shared memory before said write access but after any previous write access, if least such a thread exists; Oriented arcs linking each said previous drive node to said current node, if at least one such node exists; and • if the write access is not the first one: o oriented arcs connecting a node, called the preceding write node, representing the thread of a process that performed the immediately preceding write access to the nodes previous readers, if at least one such node exists; o or in arc connecting said write node to said current node if no said previous drive node exists.
In this case, the memorization of the identifier of the thread of which a process has performed said read access can be performed by means of at least one vector associated with each thread, each said vector comprising as many Boolean elements that there are subsets of said shared memory, each element: being associated with a said subset, having a first binary value by default and taking a second binary value, opposite to the previous one, when a process belonging to said thread performs access to the subset of said shared memory associated with said element.
The access conflict detection sub-method may also include, if no conflict is detected, a step of determining a first, ordered list of execution threads to be evaluated sequentially, and a second list of execution threads that can be evaluated in parallel, this step being implemented by linearization of said oriented graph. In this case, the method may also include a re-execution of said simulation, said re-execution comprising: a parallel evaluation phase of the execution thread belonging to said second list; and a sequential evaluation phase of the threads belonging to said first list.
Another object of the invention is a computer program product comprising computer executable computer code, stored on a computer-readable (preferably non-volatile) medium and adapted to implement such a method. Other features, details and advantages of the invention will emerge on reading the description made with reference to the accompanying drawings given by way of example and which represent, respectively: FIG. 1, a flow chart of a system level simulation electronics sequentially performed in SystemC, in accordance with the prior art; FIG. 2, a flow chart of an electronic system level simulation performed in parallel but without detection of access conflicts; FIGS. 3A to 3E, respectively, a possible implementation of a data structure for storing the read accesses, the constitution of a dependency graph, a possible implementation of the graph, the detection of loops and an example of linearization of such a graph; FIG. 4 is a flow chart of the step of creating a dependency graph; FIG. 5 is a flowchart illustrating the implementation of a mechanism for protecting critical areas of the shared memory; FIG. 6 is a flow chart of a method according to one embodiment of the invention; and FIG. 7, a flowchart of a method according to another embodiment of the invention, allowing the deterministic reproduction of a parallel simulation.
As mentioned above, the invention aims in particular to detect the problems of parallel access to shared resources, which can occur during the execution of a parallel simulation, to force the sequential execution of the processes strongly. dependents to ensure the atomicity of their evaluation, and / or to make a simulation fully predictable to allow the search for bugs and the development of the simulator and applications by reuse of the total order of a previous execution.
To do this, a method according to the invention dynamically creates a dependency graph between the processes for each address used and during each evaluation phase. This is to verify that the System processes, although evaluated in parallel, are evaluated in a partial order with respect to all these addresses. The structure of the graph is composed of nodes whose value corresponds to the identifier of a thread of execution, and oriented arcs connecting these nodes. Each arc corresponds to an order of execution between the nodes. A graph is not created if no write occurs as there can not be a competition error if all accesses are read, but all reads are recorded in case a write occurs before the end of the evaluation phase. Thus, a new node is created at each new write or read if a write has already occurred. At each new write, the graph is also extended with all previous readings. This graph detects all non-commutative actions on all variables and shared spaces in the model that can lead to atomic conflicts in the process evaluation. Thus it is important to detect a reading or a writing if it is followed by a modifying writing, and a modifying writing if it is followed by a reading.
Reading access is the most numerous, so it is important that their supervision is as effective as possible. A read access monitoring technique is illustrated in Figure 3A. A set of VSA vectors, for example 64 elements, is associated with each thread of execution, and therefore with each logical heart. The elements of each vector are binary, or boolean, and can therefore consist of a single bit, each representing an address of the shared memory; thus a simple 64-bit word represents 64 addresses. Resolution change techniques may increase the size of this range; in this case, each element of the vector represents a group of memory addresses. Initially, each element of each vector has the same binary value, for example Ό '. When a process accesses read from an address of a memory, the corresponding element of the VSA vector associated with the thread that executes the process takes the opposite value, here Ύ. This makes it easy to find for each address, or set of addresses, the shared memory, the threads of which at least one process has accessed read.
The process for creating the graph, implemented by the simulation kernel, is as follows; this process is illustrated by the flow chart of Figure 4.
When access to a memory lease is detected, the identifier of the thread of the origin of the access is retrieved; then it is checked whether it is a write or read access. If it is a read, it is saved in the corresponding VSA read vector. Otherwise, we check if the graph already exists and a new graph is created dynamically if necessary. Then, we look if readings have taken place previously at this address. If so, then the identifiers of the execution threads, and therefore the logical cores, read to this address are used for the creation of new nodes ("reader nodes") in the graph. Arcs are created between the last created write node and these new drive nodes, then between those drive nodes and a new node representing write access present. If no, we test if a write has occurred previously and an arc is created with the last node. Finally, the last added node is updated with the identifier of the current thread. At the end of the parallel evaluation phase, this graph is finalized. Nodes representing all the threads whose process has read a given address are added to the end of the graph if it is not empty. Before starting the successive parallel evaluation phase, the dependency graphs and the VSA read vectors are reset.
FIG. 3B illustrates the creation of a dependency graph GD for an address "A" corresponding to the following memory access succession: a process of the execution thread W1 ("W" of the English "worker", c 'i.e.' worker ') reads memory element A, - thread process W2 accesses read A, - process thread W3 accesses read A a process of thread W4 accesses write to A, a process of thread W5 accesses write to A, a process of thread W6 accesses read to A, a process of Execution thread W6 accesses write to A, - A process of thread W7 accesses read to A.
Figure 3C illustrates a data structure that can be used to represent a dependency graph. This structure is in the form of a vector, the size of the number of nodes of the graph, whose elements ID are, or point to, lists LA of following nodes.
There is one graph per address or group of addresses that are handled unitarily (resolution). A node has a value that matches the thread ID that is the source of the memory access; the value "-1" is assigned to the elements of the vector that correspond to threads that have no successor in the graph. At the end of each evaluation phase, all the graphs are merged to obtain a global graph representative of all the shared memory resource access dependencies of the model for this evaluation phase. To check the absence of an atomic conflict is to check the absence of cycles in the global dependency graph. It is then sufficient to detect if the graph has strongly related components. This can be done for example through the use of a Tarjan algorithm. If a conflict is detected, the simulation is stopped and the nature of the conflict is provided to the user. With this information, the designer can correct the application running on the model and protect the critical sections, define the offending addresses to be accessed atomically, or for example allocate differently the processes causing the conflict. access to logical cores so that these processes are evaluated sequentially.
In order to improve the efficiency of the method, it may be advantageous to carry out the construction of the dependency graphs and the conflict detection in a hierarchical manner by considering that the shared memory is formed of pages, each comprising a plurality of memory addresses. . Thus, for example, one can first detect strongly related components in each individual graph; then, if no conflict is detected at this level, merge the graphs corresponding to memory locations belonging to the same page and proceed to the detection of strongly related components in these graphs of higher level, finally merge all these graphs in a graph globally and perform one last time a detection of strongly related components. The number of levels in such a hierarchical approach may, of course, be different from three.
Figure 3D shows a dependence graph GD having a strongly connected component CFC constituted by the nodes W1, W4 and W5 and by the arcs which connect them, which form a loop. The graph corresponds to the following memory access sequence: a process of the execution thread W1 accesses reading A, a process of the thread of execution W2 accesses reading A, a process of the thread of execution W3 accesses read A, - a thread process W4 write accesses A, - a thread process W5 write accesses A, - the thread process W1 accesses again read at A, - a process of thread W6 accesses write to A.
It is easy to verify that the second read access by W1 causes an atomicity violation.
If the global dependency graph is acyclic, a simple linearization can be performed in order to obtain, for each evaluation phase, the list of execution threads (and therefore processes) to be executed in parallel and the ordered list of execution threads to execute sequentially. Serialization can be obtained via a graph topological sort algorithm. Figure 3E illustrates a situation where the maximum number of threads is 63 (a 64-core computer system, one of which is dedicated to the execution of the simulation kernel) where the execution threads 1 to 4 perform accesses. memory that do not give rise to conflict. The corresponding graph is shown on the left side of the figure; on the right are represented: an ordered list NS, obtained by linearization of the graph, containing the threads of execution which must be evaluated in a sequential way (one will notice that the indicated order -1,2,3,4- is not not the only one possible, the order 1,3, 2, 4 could also have been chosen); an NP list of threads (0 and 5 to 62) that can be evaluated in parallel (a sequential execution is of course possible, in which case the order does not matter). It should be emphasized that, if the threads are evaluated in parallel by different logical cores, the processes of each thread are necessarily evaluated sequentially.
It will also be noted that thread 1 could be assigned to the NP list if the threads of NP are re-evaluated before those of NS, or thread 4 can be assigned to the NP list. in the case where the threads of NP will be re-evaluated after those of NS. This solution makes it possible to increase the parallelism of the NP evaluation and to reduce the duration of the NS evaluation.
These two lists, or at least the list NS, can be saved in a file, called execution file, with an indication of the time of the simulation and the iteration number of the corresponding evaluation phase. This file is necessary to allow a deterministic re-execution of the simulation, as will be explained later with reference to FIG.
As was mentioned above with reference to VSA read access monitoring vectors, it is possible to improve the efficiency of the process by acting on its "granularity", that is to say by considering several locations of memory as a single location, to which is assigned a single element of each VSA vector and a single dependency graph. The disadvantage is that we then cause "false positives", that is to say that we detect violations of atomicity that are actually non-existent.
According to a preferred embodiment of the invention, zones of the shared memory can be defined as being "critical", and thus be subject to special surveillance in order to force the atomicity of their accesses upstream. and avoid the occurrence of access conflict, instead of just detecting it a posteriori using dependency graphs. Such a monitoring mechanism is illustrated by the flow chart of FIG.
When requesting access to an address in the shared memory, the kernel checks whether this address belongs to a critical region. If so, it checks whether this is the first access to this address during the current evaluation phase. If this is the case, an identifier of the thread at the origin of the request is saved in a data structure, and access is allowed. It is the same for all subsequent accesses by processes of this same thread (which is determined by the identifier previously saved), which are treated as "first" access. Otherwise, all processes belonging to the same thread as the one that made the access attempt are preempted and pushed into a queue; a node corresponding to the preempted process is nevertheless created in the dependency graph. The processes contained in the queue will be executed sequentially at the end of the current evaluation phase. The flowchart of FIG. 6 illustrates, in a simplified manner, the progress of a parallel system simulation according to one embodiment of the invention. This flowchart is similar to Figure 2, but has two additional steps: At the end of the parallel evaluation phase, the simulation kernel detects potential shared memory access conflicts using the dependency graphs. , and if such conflicts are detected the simulation stops; if no access conflict has been detected, the kernel proceeds with the sequential evaluation of the processes that were pre-empted during the parallel evaluation phase because they requested access to an address of a critical zone of the memory to which another process had already had access. This sequential evaluation order respects the partial order obtained by the global dependency graph obtained at the end of the parallel evaluation.
In addition, as mentioned above, the dependency graphs make it possible to construct a list of execution threads that can be executed in parallel and those that must execute sequentially during each evaluation phase. In one embodiment, the evaluation phase is identified by the current system simulation time and the evaluation phase number during this SystemC time. This information is saved in the execution file. No information is recorded in the file for times when no dependency was observed. This file can then be used during another run of the same model ("replay mode") in order to recreate the exact behavior and total order of the processes. This makes it possible to guarantee the determinism of the simulations. The flowchart of FIG. 7 illustrates the progress of the evaluation phase according to one embodiment of the invention, including the possibility of "replaying" a simulation already performed. At the beginning of each evaluation phase, a counter is incremented. If the simulation is to be performed in "replay" mode, the kernel checks whether the current execution time and the counter value correspond to information stored in the execution file. If yes, the NP list of threads that can be run in parallel to the time and for the evaluation phase in question is extracted from the file. These threads are then executed in parallel. Then, if the simulation is not to be performed in "replay" mode, we proceed to the detection of access conflicts and the update of the execution file. In "replay" mode this step is not necessary, and one simply proceeds to extract the NS list from said file. Then, in both cases, the threads of said list are evaluated sequentially,
The simulation ends when a conflict is detected, if the replay scheduling list is empty, or after the completion of the sequential evaluation. The invention has been described in relation to particular embodiments, but numerous variants are conceivable. In particular, several optimizations are possible. For example :
To improve performance, parallel graph creations can be made; this is made possible by a partitioning of the memory space in blocks, which makes it possible to access, create and modify the graph structures in parallel by addressing blocks and then to merge them during the detection phase, by example with a merge tree.
It is possible to previously declare read-only memory spaces to avoid monitoring these addresses. Reducing the space to monitor can significantly improve performance.
It is possible to add the first level of nodes of each dependency graph (respectively the last) in the list of execution threads that can be run in parallel if, during a deterministic replay, the parallel evaluation phase is evaluated before (respectively after) the sequential phase.
Bibliography: [MOY13] M. Moy, Parallel Programming with System for Loosely Timed Models: A Non-Intrusive Approach, DATE 2013.
[CHEN12] W. Chen, X. Han and R. Dômer, "Out-of-order Parallel Simulation for ESL Design", in DATE 2012.
[MELL10] A. Mello, I. Maia, A. Greiner, and F. Pecheux, Parallel Simulation of SystemC TLM 2.0 Compiling MPSoC on SMP Workstations, DATE 2010.
[SCHU13] C. Schumacher et al., LegaSCi: Legacy SystemC Model Integration into Parallel SystemC Simulators, IPDPSW 2013. [EZUD09] P. Ezudheen et al., Parallelizing SystemC kernel for fast hardware simulation on SMP machines, PADS 2009.
[BLAN08] N. White and D. Kroening, Race Analysis for SystemC using Model Checking, In IEEE / ACM International Conference on Computer-Aided Design (ICCAD), Piscataway, USA, 2008, pp. 356-363.
[HELM06] C. Helmstetter, F. Maraninchi, L. Maillet-Contoz and M. Moy, Automatic Generation of Schedules for Improving the Testing of Systems-on-a-Chip, In Formal Methods in Computer Aided Design (FMCAD) , Washington, DC, USA, 2006, pp. 171-178.
[SEN08] A. Sen, V. Ogale and M.S. Abadir, Predictive run-time verification of multiprocessor SoCs in SystemC, In DAC, New York, USA, 2008, pp. 948-953.
[LE14] H. M. Le Drechsler R., Towards Verifying Determinism of SystemC Designs, in DATE, March 2014.
[SCFIU12a] C. Schumacher, J.H. Weinstock, R. Leupers and G. Ascheid, SCANDAL: SystemC analysis for nondeterministic anomalies, in Forum on Specification and Design Languages (FDL), Vienna, Austria, 2012.
[HERR06] F. H errera and E. Villar, Extension of the SystemC kernel for simulation coverage, In Forum on Specification and Design Languages (FDL), 2006, pp. 161-168.
[SCHU10] C. Schumacher, R. Leupers, D. Petras and A. Hoffmann "parSC: Synchronous Paralle! SystemC Simulation on Multi-Core Host Architectures ", in CODES + ISSS, pages 241-246, Scottsdale, Arizona, USA, October 2010.
[CHU14] M.-K. Chung, J.-K. Kim, and S. Ryu "SimParalIel: A High
Performance Parallel SystemC Simulator Using Hierarchical Multi-threading ", International Symposium on Circuits and Systems (ISCAS), Melbourne, Australia, June 2014.

权利要求:
Claims (11)
[1" id="c-fr-0001]
An electronic system level parallel simulation method implemented by means of a multi-core computer system, said method comprising parallel evaluation of a plurality of competing processes of said simulation on a plurality of logical cores of said computer system. , said concurrent processes being grouped together in execution threads, the concurrent processes of the same thread being evaluated sequentially by the same logical heart of the system, the method being characterized in that it comprises a subprocess of detecting conflicts of access to a shared memory of a simulated electronic system, said sub-method being implemented by a simulation kernel executed by said computer system and comprising: a step of building an oriented graph (GD ) representative of access to said memory shared by the processes evaluated by said processes competitors; and • a loop detection step (CFC) in said graph; a loop being considered representative of a conflict of access to said shared memory.
[2" id="c-fr-0002]
The method of claim 1 wherein said oriented graph comprises nodes (W1-W7), each representing a thread, and arcs each connecting two nodes, each arc representing an execution order relationship between the nodes. two so-called threads of execution.
[3" id="c-fr-0003]
The method according to one of the preceding claims comprising a plurality of concurrent process evaluation phases corresponding to successive simulation times, wherein said steps of the access conflict detection sub-method are implemented after each said evaluation phase.
[4" id="c-fr-0004]
4. Method according to one of the preceding claims also comprising: during said parallel evaluation of a plurality of competing processes, a step of monitoring at least one zone of said shared memory of the simulated electronic system previously declared as critical, and preempting all the processes belonging to the same thread that a process having tried to access said one or each said zone after a first access by another process belonging to another thread; and after said parallel evaluation, a step of sequentially evaluating the processes preempted in the previous step.
[5" id="c-fr-0005]
5. Method according to one of the preceding claims wherein said step of constructing an oriented graph comprises: the construction of a plurality of partial oriented graphs, each representing access to a subset of said shared memory; and • merging these graphs to form a single oriented graph.
[6" id="c-fr-0006]
The method of claim 5 wherein said shared memory of the simulated electronic system comprises a plurality of memory locations grouped into a plurality of pages, and wherein said constructing a plurality of partial oriented graphs comprises: • the construction of a plurality of partial oriented graphs each representative of access to a respective memory location; • the fusion of partial oriented graphs corresponding to the memory locations of the same page.
[7" id="c-fr-0007]
7. Method according to one of claims 5 or 6 wherein the construction of each said partial oriented graph comprises: • at each read access to said subset of the shared memory, the identification and storage of an identifier the thread of a process that has performed said read access; During each write access, the creation of: a node, called current node, representing the execution thread of which a process has made said write access; o nodes, said previous reader nodes, representatives of the threads whose at least one process has performed read access to said subset of the shared memory before said write access but after any previous write access, if at least such a thread exists; o oriented arcs connecting each said previous drive node to said current node, if at least one such node exists; and o if the write access is not the first: "oriented arcs connecting a node, said previous write node, representing the thread of a process that performed the immediately preceding write access to said nodes previous readers, if at least one such node exists; or an arc connecting said write node to said current node if none of said preceding drive node exists.
[8" id="c-fr-0008]
8. The method of claim 7 wherein storing the identifier of the thread of which a process has performed said read access is performed by means of at least one vector (VSA) associated with each thread, each said vector comprising as many Boolean elements as there are subsets of said shared memory, each element: • being associated with a said subset, • having a first binary value by default and • taking a second value binary, opposite to the previous one, when a process belonging to said thread of execution makes an access to the subset of said shared memory associated with said element.
[9" id="c-fr-0009]
The method according to one of the preceding claims, wherein said access conflict detection sub-method also comprises, if no conflict is detected, a step of determining a first list (NS), ordered , threads to be evaluated sequentially, and a second list (NP) of threads that can be evaluated in parallel, this step being implemented by linearization of said oriented graph.
[10" id="c-fr-0010]
10. The method of claim 9 further comprising a re-run of said simulation, said replay comprising: a parallel evaluation phase of the execution son belonging to said second list; and a sequential evaluation phase of the threads belonging to said first list.
[11" id="c-fr-0011]
11. Computer program product comprising computer executable computer code stored on a computer readable medium and adapted to implement a method according to one of the preceding claims.

类似技术:

公开号 | 公开日 | 专利标题

EP3371719B1|2019-07-17|Electronic system level parallel simulation method with detection of conflicts of access to a shared memory

Kundu et al.2008|Partial order reduction for scalable testing of SystemC TLM designs

JP4201701B2|2008-12-24|Synthesis of verification language

US20160019071A1|2016-01-21|Accurate static dependency analysis via execution-context type prediction

Singhoff et al.2007|AADL modeling and analysis of hierarchical schedulers

Ventroux et al.2016|A new parallel SystemC kernel leveraging manycore architectures

US10585648B2|2020-03-10|Systems and methods for aggregating implicit and explicit event code of executable models

US9176732B2|2015-11-03|Method and apparatus for minimum cost cycle removal from a directed graph

WO2021069626A1|2021-04-15|Method for reproducible parallel simulation at electronic system level implemented by means of a multi-core discrete-event simulation computer system

Madl et al.2007|Performance estimation of distributed real-time embedded systems by discrete event simulations

González2012|Java 7 concurrency cookbook

Moiseev et al.2013|A static analysis approach to data race detection in systemc designs

US10579441B2|2020-03-03|Detecting deadlocks involving inter-processor interrupts

FR2957434A1|2011-09-16|DEVICE FOR TESTING A MULTITASTIC CALCULATION ARCHITECTURE AND CORRESPONDING TEST METHOD

Chang et al.2015|May-happen-in-parallel analysis of ESL models using UPPAAL model checking

EP2956874B1|2017-03-15|Device and method for accelerating the update phase of a simulation kernel

Maraninchi et al.2009|42: Programmable models of computation for the component-based virtual prototyping of heterogeneous embedded systems

Keutzer et al.2005|Using minimal minterms to represent programmability

Sen et al.2014|Hybrid dynamic data race detection in systemC

Busnot2020|Parallel Standard-Compliant SystemC Simulation of Loosely-Timed Transaction Level Models

Ge2014|Property driven verification framework: application to real time property for UML MARTE software design

Herrera et al.2009|Local application of simulation directed for exhaustive coverage of schedulings of systemc specifications

Agosta et al.2014|Fault Tolerance

WO2018076979A1|2018-05-03|Detection method and apparatus for data dependency between instructions

JP2009230667A|2009-10-08|Property generation system and property verification system

同族专利:

公开号 | 公开日

EP3371719A1|2018-09-12|

EP3371719B1|2019-07-17|

US20190057173A1|2019-02-21|

FR3043222B1|2018-11-16|

WO2017076723A1|2017-05-11|

US10943041B2|2021-03-09|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

EP0438890A2|1990-01-23|1991-07-31|AT&T Corp.|Unboundedly parallel simulations|

US20150058859A1|2013-08-20|2015-02-26|Synopsys, Inc.|Deferred Execution in a Multi-thread Safe System Level Modeling Simulation|

US11119923B2|2017-02-23|2021-09-14|Advanced Micro Devices, Inc.|Locality-aware and sharing-aware cache coherence for collections of processors|

US20200218672A1|2017-07-07|2020-07-09|Micron Technology, Inc.|Rpmb improvements to managed nand|

US10853223B2|2018-01-19|2020-12-01|Arm Limited|Simulation of transactions|

FR3101987B1|2019-10-11|2021-10-01|Commissariat Energie Atomique|Electronic system level reproducible parallel simulation method implemented by means of a multi-core discrete event simulation computer system|

法律状态:
2016-11-30| PLFP| Fee payment|Year of fee payment: 2 |

2017-05-05| PLSC| Search report ready|Effective date: 20170505 |

2017-11-30| PLFP| Fee payment|Year of fee payment: 3 |

2019-11-29| PLFP| Fee payment|Year of fee payment: 5 |

2021-08-06| ST| Notification of lapse|Effective date: 20210705 |

优先权:

申请号 | 申请日 | 专利标题

FR1560551A|FR3043222B1|2015-11-04|2015-11-04|METHOD FOR PARALLEL SIMULATION OF ELECTRONIC SYSTEM LEVEL WITH DETECTION OF CONFLICTS OF ACCESS TO A SHARED MEMORY|

FR1560551|2015-11-04|FR1560551A| FR3043222B1|2015-11-04|2015-11-04|METHOD FOR PARALLEL SIMULATION OF ELECTRONIC SYSTEM LEVEL WITH DETECTION OF CONFLICTS OF ACCESS TO A SHARED MEMORY|

US15/771,073| US10943041B2|2015-11-04|2016-10-26|Electronic system level parallel simulation method with detection of conflicts of access to a shared memory|

PCT/EP2016/075860| WO2017076723A1|2015-11-04|2016-10-26|Electronic system level parallel simulation method with detection of conflicts of access to a shared memory|

EP16793789.5A| EP3371719B1|2015-11-04|2016-10-26|Electronic system level parallel simulation method with detection of conflicts of access to a shared memory|

[返回顶部]